# Intro
Repository associated with the paper "Into the LAION's Den:  Investigating Hate in Multimodal Datasets"

## Here's the directory structure:

suppl_neurips_23 
├── code
│   ├── 1_Pysentimiento_400M_2Ben.ipynb
│   ├── 2_Walkthrough_Pysentimiento_400M_2Ben.ipynb
│   └── 3_NSFW_Pysentimiento_2Ben.ipynb
├── data
│   └── nlp_hate
│       ├── df_parquet_400m_2b.csv
│       ├── df_hcr_filewise_400M_2B.csv
│       ├── df_summary_filewise_400M_2B.csv
│       └── README.md
├── plots
│   ├── filewise_violin.pdf
│   ├── filewise_violin.png
│   ├── n_excess.pdf
│   ├── n_excess.png
│   ├── p_detect.pdf
│   └── p_detect.png
├── README.md


## Walkthrough

The code is in the form 3 jupyter notebooks:

├── 1_Pysentimiento_400M_2Ben.ipynb: The foundational notebook that generates Fig 1 and Fig 2 and Table 2 in the paper
├── 2_Walkthrough_Pysentimiento_400M_2Ben.ipynb: A walkthrough notebook that ensures that an end-user is able to navigate through and use the (meta)data-assets curated in this research work
└── 3_NSFW_Pysentimiento_2Ben.ipynb: The NSFW analysis containing notebook that generates Table 2 in the paper

## Self-declared limitations on reproducibility

a) The notebooks 1 and 2 above were run on Colab Pro + instances that cost the author $49.99 at the time of submission
b) As such, this afforded the author 54-to-89 GB RAM  + A100 GPU laden VM instances during the runs that allowed the author to load certain data-assets and inference-chunk sizes that might otherwise yield 00M (Out of Memory) errors otherwise
c) The sheer memory footprint meant that certain data-assets such as the hate_detect_laion_400m_2B-en.zip file in the walkthrough notebook (2_Walkthrough_Pysentimiento_400M_2Ben.ipynb) have been stored on academic servers owned by the institutions that the authors are affiliated to. We have replaced these what are currently dead links as we realized that one could theoretically deanonymize our affiliations using certain clues in the presistent  links' URL(s)

## Explanation of the Data assets in the data directory

1: ```df_parquet_400m_2b.csv```: A 160 x 4 shaped input file-name dataframe whose ```file_loc``` column has the precise order of the 160 (=32 + 128) parquet files that will be used to index the results.

2: ```df_hcr_filewise_400M_2B.csv```: A 160 x 3 output hcr dataframe whose rows map to the 160 parquet files listed in the ```file_loc``` column of ```df_parquet_400m_2b.csv``` and the columns map to the hcr at 0.5 values mapping to ```[P_hateful,P_targeted,P_aggressive]```

3: ```df_summary_filewise_400M_2B.csv``` : A summary (160, 8) shaped dataframe that is a concatenation of ```df_parquet_400m_2b.csv``` & ```df_hcr_filewise_400M_2B.csv``` and a ```file_ind``` column that allows you index the 640 meta-data assets download in (1) above from [here](https://anonymous.edu/assets/data/papers/hate_detect_laion_400m_2B-en.zip)
